Skip to content

Conversation

@pragupta
Copy link
Collaborator

@pragupta pragupta commented Sep 9, 2025

rocm_base: 681e60e

Tested this PR on MI300x using registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16623_ubuntu24.04_py3.12_pytorch_rocm7.1_internal_testing_681e60e1

Ran the following UTs:
test_nn, test_torch, test_cuda, test_ops, test_unary_ufuncs, test_autograd, inductor/test_torchinductor

All ran fine, attaching logs!
default_ut.log

Successful wheel build job with this branch: http://rocm-ci.amd.com/view/preview/job/pytorch2.8-manylinux-wheels-preview/116/

pytorchmergebot and others added 30 commits September 4, 2025 13:13
…lt (pytorch#159889)"

This reverts commit 4ae57d4.

Reverted pytorch#159889 on behalf of https://github.com/jeanschmidt due to Failing internal tests, probably typechecks. See D81588399 ([comment](pytorch#159889 (comment)))
On Zen 2 (AMD EPYC) and Intel Sapphire Rapids this fails with small differences when compiled with native targeted optimizations. I.e. it fails with `-march=znver2` but succeeds with `-march=znver1`.

I assume some operator fusing is being used by GCC. Small differences like using `vmovdqa` can be seen in the minimized code of the baddbmm kernel: https://godbolt.org/z/jsxMa91Wb

The greatest differences are consistent and the same on both CPU architectures:
```
Greatest absolute difference: 3.43852152582258e-05 at index (1, 2, 1) (up to 1e-05 allowed)
Greatest relative difference: 3.6034286949870875e-06 at index (1, 2, 1) (up to 1.3e-06 allowed)
```

Hence I assume this is in the expected tolerances  especially as `complex128` and all other types pass.
Pull Request resolved: pytorch#152424
Approved by: https://github.com/malfet
This reverts commit 90b0864.

Reverted pytorch#160449 on behalf of https://github.com/jeanschmidt due to Already discussed with @ezyang about the internal quirks and errors ([comment](pytorch#160449 (comment)))
Many users want a config to force all cuda ops captured by cudagraph. When not possible, pt2 should error.

This PR adds `torch._inductor.triton.cudagraph_or_error` for that (default as False). Also added an environment variable `TORCHINDUCTOR_CUDAGRAPH_OR_ERROR` to control.

Pull Request resolved: pytorch#161862
Approved by: https://github.com/ezyang, https://github.com/mlazos
…ytorch#162044)"

This reverts commit cd529b6.

Reverted pytorch#162044 on behalf of https://github.com/jeffdaily due to mi200 backlog is purged, and mi300 runners are failing in GHA download ([comment](pytorch#162044 (comment)))
…h#161907)

`CMAKE_PREFIX_PATH` is a list of paths used to find dependencies. The test overwrites that with a single path causing dependencies such as protobuf or Abseil not being found.

Instead prepend the path to the existing value.

This fixes a test failure:
> pytorch-v2.7.1/test/inductor/test_aot_inductor_package.py", line 242, in test_compile_after_package
>    self.assertTrue(so_path.exists())
> AssertionError: False is not true

Caused by:
```
/software/binutils/2.42-GCCcore-13.3.0/bin/ld: cannot find -labsl::utility: No such file or directory
/software/binutils/2.42-GCCcore-13.3.0/bin/ld: cannot find -labsl::variant: No such file or directory
collect2: error: ld returned 1 exit status
```

Pull Request resolved: pytorch#161907
Approved by: https://github.com/Skylion007
I found a number of places that seem to want forwarding
references but the type signature does not reflect that

Pull Request resolved: pytorch#161094
Approved by: https://github.com/malfet
…ch#158747)

This is a part of our effort for integrating Composable Kernel library for Inductor backend. Currently we have a submodule, but would prefer to have commit pin control over the library as with Triton. We intentionally avoid putting all installation logic in CI scripts to allow locally built versions to have this functionality.

The idea is to have CK as a pytorch dependency in pytorch 2.9 release to allow people to use it with inductor and AOT inductor and then gradually step away from submodule usage. Right now CK usage in SDPA/Gemm is tied to submodule files.

This PR is a remake of due to branch error: pytorch#156192

Pull Request resolved: pytorch#158747
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <[email protected]>
Co-authored-by: Jack Taylor <[email protected]>
Co-authored-by: Max Podkorytov <[email protected]>
Co-authored-by: Copilot <[email protected]>
[PEP 735](https://peps.python.org/pep-0735) introduces the
[dependency-groups] table for a number of use-cases one of
which includes specifying development dependencies for projects.

Pull Request resolved: pytorch#161216
Approved by: https://github.com/seemethere
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@83c5a5](intel/torch-xpu-ops@83c5a5a), includes:

- Revert "Disable xccl timer avoid drlm hang" because XPU time event issue has been fixed
- Fallback lu_factor kernel to CPU for single batch
- Enable aten::linalg_inv and aten::linalg_inv_ex on XPU
Pull Request resolved: pytorch#162062
Approved by: https://github.com/EikanWang
)

This PR implements the semantics change to `torch._dynamo.error_on_graph_break`:
- ~`torch.compile` now has a new `error_on_graph_break` kwarg that serves as a lower-priority toggle for erroring/continuing on graph breaks~
- `error_on_graph_break` is a new internal `torch.compile `setting that is lower-priority than `fullgraph`. It allows the user to toggle erroring/continuing on graph breaks.
- `error_on_graph_break` does nothing when `fullgraph=True`
- `error_on_graph_break` does NOT guarantee a single graph

Followup [DONE]: need to change the programming model docs to reflect the 3 graph break modes for compilation:
- `fullgraph=True`: enforce one graph, no graph breaks, cannot be toggled
- `fullgraph=False, error_on_graph_break=True`: errors on graph breaks, latter can be toggled during compile time
- `fullgraph=False, error_on_graph_break=False`: resumes tracing on graph breaks, latter can be toggled during compile time

Pull Request resolved: pytorch#161747
Approved by: https://github.com/mlazos
ghstack dependencies: pytorch#161739
…he CUDACachingAllocator (pytorch#158352)

## Introduction

During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it **capturing graph**) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture.

This PR adds an experimental flag `graph_capture_record_stream_reuse: True|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path.

## Terms

* **Free marker**: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it.
* **Terminal**: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`.

## When can we reuse a block during capture?

### Strong Rule (Graph-Wide Safety)

This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph.

> A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph.

Why it's safe:

This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness.

### Per-stream Rule (A Practical Optimization)

The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check.

In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream.

> Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S.

In short, a block is considered **reusable** on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins.

## Implementation

* On `free(block)` during capture
  * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail.
  * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path.
  * Otherwise, store the marker handles and keep the block in the capture-private structures.
* On `allocate(stream)` during capture (attempt per-stream reclaim)
  * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`.
  * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal.
    * If yes, hand the block to S for immediate reuse within the same capture.
    * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances.
* On capture end
  * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture.

## Examples (2 streams)

<img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" />

* Case 0 — Unsafe
The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails.
Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this.
* Case 1 — Reusable on stream 1
Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1.
* Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator`
This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable.
* Case 3 — Safe (strong rule holds)
In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block.
* Case 4 — Freeing after a join
See the note below.

## Edge Case: Freeing after a join

Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](pytorch#158352 (review))).

In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused.

## Thanks
Thanks to @galv for his great idea around graph parsing and empty nodes.

Pull Request resolved: pytorch#158352
Approved by: https://github.com/ngimel, https://github.com/eqy

Co-authored-by: Jeff Daily <[email protected]>
…orch#161984)

Added a helper API to tell if the world is entirely within a P2P domain or crosses network.
This is mainly for nblocks tuning purpose. (In later PRs)

Pull Request resolved: pytorch#161984
Approved by: https://github.com/ngimel
ghstack dependencies: pytorch#161983
so that the signal calls do not step on each other's foot.

Pull Request resolved: pytorch#162026
Approved by: https://github.com/ngimel
…161407)

Summary:

Creates a fallback path for `torch._grouped_mm`, using the naive for
loop implementation (or bmm).

For the sake of keeping the PR small, this PR only enables SM80+ (CUDA
capability 8.0 and up), since I am testing this on an A100 machine. In
future PRs, we can increase the coverage of the fallback to:
1. float32 and float16, which will extend the GPU coverage
2. cpu

Test Plan:

```bash
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_2d_3d -x
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_3d_2d -x
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_2d_2d -x
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_3d_3d -x
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: pytorch#161407
Approved by: https://github.com/drisspg, https://github.com/eqy
…61717)

Summary:

Moves the `torch._grouped_mm` fallback from cuda-only code to a place
where it can be used by multiple backends. Specifically:
1. make the fallback path and util functions reusable and move them to
   `ATen/native/GroupedMMUtils.h`
2. register a backend-agnostic kernel to composite explicit autograd key
3. refactor the grouped_mm tests to their own test case and enable CPU

At the end of this PR, here is the support matrix:
* CUDA SM90+: fast path with test coverage (no change)
* CUDA SM80+: fallback with test coverage (no change)
* CPU: fallback works, but without test coverage (new in this PR)
* other SM versions and other backends: will probably already work, but
  let's leave this to future PRs
* float32/float16: will probably already work, but let's leave this to
  future PRs

Test Plan:

```bash
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: pytorch#161717
Approved by: https://github.com/ngimel, https://github.com/drisspg
ghstack dependencies: pytorch#161407
…62059)

Summary:

Enables `torch.float32` and `torch.float16` options in
`torch._grouped_mm`. Note that the fast path is only enabled if `mat_a`,
`mat_b`, and `out_dtype` are `torch.bfloat16`.

Saving for future PRs:
1. enabling testing on more platforms
2. supporting out_dtype != mat_a.dtype
3. opinfo
4. better compile support

Test Plan:

```bash
// on A100 and H100
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x
// on H100
pytest test/test_matmul_cuda.py -s -k test_scaled_grouped_gemm -x
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: pytorch#162059
Approved by: https://github.com/ngimel, https://github.com/eqy
ghstack dependencies: pytorch#161407, pytorch#161717
I dont have a failing test case but just saw an extra guard somewhere.

Pull Request resolved: pytorch#162105
Approved by: https://github.com/williamwen42, https://github.com/StrongerXi, https://github.com/jansel
…pytorch#161688)

Fixes pytorch#161080
torch.export.export fails with TypeError: expand() got an unexpected keyword argument 'implicit' when calling torch.expand_copy(..., implicit=True). This happened because expand_copy = _make_copy_from_view(aten.expand) register aten. expand as the decomposition path for aten.expand_copy, which doesn’t accept the implicit argument.

I have added an explicit a decomposition for aten.expand_copy in torch/_decomp/decompositions.py to ignore the implicit argument, and a simple unit test to demonstrate the bug being fixed.
Pull Request resolved: pytorch#161688
Approved by: https://github.com/angelayi, https://github.com/can-gaa-hou
pytorch#161951)

…h.is_complex.

The PR proposes adding a simple, self-explanatory example to the documentation page. The example demonstrates the function's output for tensors with various data types, showing both True and False return values.

Fixes pytorch#161859

Pull Request resolved: pytorch#161951
Approved by: https://github.com/zou3519
Update cpp-httplib with better error handling, bugfixes, and performance. Header only library update.
Pull Request resolved: pytorch#162181
Approved by: https://github.com/jansel
Summary:
att
Test Plan:
ci
Rollback Plan:

Reviewed By: minjang

Differential Revision: D80828148

Pull Request resolved: pytorch#161798
Approved by: https://github.com/minjang, https://github.com/SherlockNoMad
This reverts commit 2c03f0a.

Reverted pytorch#162007 on behalf of https://github.com/jeanschmidt due to Breaks internal builds see [D81588372](https://www.internalfb.com/diff/D81588372), @malfet may you help the author? ([comment](pytorch#162007 (comment)))
This reverts commit b40d943.

Reverted pytorch#162001 on behalf of https://github.com/jeanschmidt due to break a few internal tests ([comment](pytorch#161999 (comment)))
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Sep 9, 2025

Jenkins build for ab5575833f1eb9066df192dd91d9d7bd43385f65 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Sep 9, 2025

Jenkins build for 60644390d3e6c3da6228427bb14c6b759011f97a commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Sep 9, 2025

Jenkins build for 304889c9da6276844081450426d1722846961f6c commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Copy link
Collaborator

@pruthvistony pruthvistony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rubber-stamping the PR. Build is successful and conflicts have been resolved.

@pragupta pragupta force-pushed the rocm7.1_internal_testing_IFU_2025-09-09 branch from 304889c to dba8539 Compare September 10, 2025 18:44
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Sep 10, 2025

Jenkins build for 9a66f82081053fff8105de82cd5f4593c393caf1 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@pragupta pragupta changed the title [AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-09 [DO NOT MERGE] [AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-09 Sep 11, 2025
@pragupta
Copy link
Collaborator Author

Do not merge yet, vllm team is testing with this branch and if perf looks good then we will merge it. Otherwise, wait till 9/15.

@@ -1,5 +1 @@
<<<<<<< HEAD
56765e8c1f6490e21312b46242ed78cb2dd46d35
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be updated to new branch

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this will be handled separately

if ((dims > 0) && (dims <= 2)) {
auto divmod = sizes_[0].divmod(linear_idx);
<<<<<<< HEAD
#pragma unroll
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we merge this line, then next IFU this will not show as conflict.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't follow. I chose upstream change so next time we do IFU it won't show as merge conflict. If I choose HEAD, then we are picking local changes and may show as a merge conflict.

return HIP_R_4F_E2M1;
#else
<<<<<<< HEAD
// Return HIP_R_4F_E2M1 enum value for earlier ROCm version.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.
In current scenario better to merge <<HEAD block, to avoid conflicts at next IFU.

case ConvBackend::Miopen:
case ConvBackend::MiopenDepthwise:
case ConvBackend::MiopenTranspose:
<<<<<<< HEAD
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matching upstream, it is fine.

}

// TODO: Remove PYTORCH_MIOPEN_SUGGEST_NHWC once ROCm officially supports NHWC in MIOpen
<<<<<<< HEAD
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matching upstream, it is fine.

Copy link
Collaborator

@pruthvistony pruthvistony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On NHWC batchnorm, need Dmitry confirmation.

Other conflicts are resolved properly. AFAIK.

@pragupta pragupta changed the title [DO NOT MERGE] [AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-09 [AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-09 Sep 16, 2025
Copy link
Collaborator

@jithunnair-amd jithunnair-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Hope other IFUs get shorter, both in time and diffs :)

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Sep 17, 2025

Jenkins build for 9e7df766290def1ac0112fc758a6fa1ea126e95a commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Detected error during base docker image building:


#30 [stage-0 22/56] COPY ./common/install_rocm_magma.sh install_rocm_magma.sh
#30 DONE 0.1s

#31 [stage-0 23/56] RUN bash ./install_rocm_magma.sh
#31 0.262 ./install_rocm_magma.sh: line 91: syntax error: unexpected end of file
#31 ERROR: process "/bin/sh -c bash ./install_rocm_magma.sh" did not complete successfully: exit code: 2
------
 > [stage-0 23/56] RUN bash ./install_rocm_magma.sh:
0.262 ./install_rocm_magma.sh: line 91: syntax error: unexpected end of file
------

@pragupta pragupta merged commit 3a58f35 into rocm7.1_internal_testing Sep 17, 2025
44 of 46 checks passed
@pragupta pragupta deleted the rocm7.1_internal_testing_IFU_2025-09-09 branch September 17, 2025 11:51
@pragupta pragupta restored the rocm7.1_internal_testing_IFU_2025-09-09 branch September 24, 2025 20:07
@pragupta pragupta deleted the rocm7.1_internal_testing_IFU_2025-09-09 branch September 24, 2025 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.